AITopics

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsMar-19-2026, 17:03:04 GMT

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded fine-grained textual description (referred to as rationale) that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g.

artificial intelligence, large language model, natural language, (7 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsFeb-18-2026, 14:46:28 GMT

ed1d3d4c64dc1b95332a8cde3f2a0bdf-Paper-Conference.pdf

moe block, multimodal llm, visual instruction, (12 more...)

Country:

North America > United States > California > Santa Clara County > San Jose (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report > Experimental Study (0.93)
Research Report > New Finding (0.93)

Industry:

Education (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-12-2026, 11:07:35 GMT

5e84e4413268b713f0d4a1b23a9dae57-Paper-Conference.pdf

large language model, machine learning, natural language, (17 more...)

Country:

Asia > China > Fujian Province > Xiamen (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Neural Information Processing SystemsFeb-11-2026, 13:57:38 GMT

3d158f054ff0cb83397367234899db07-Paper-Conference.pdf

large language model, machine learning, natural language, (19 more...)

Country:

North America > United States (0.45)
Europe > United Kingdom (0.28)
North America > Canada (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)
(16 more...)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(5 more...)

Neural Information Processing SystemsDec-24-2025, 21:07:07 GMT

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping, while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal large language models (MLLMs). MLLM-CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality.

large language model, mllm-compbench, natural language, (7 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.75)

Aivalis, Theodoros, Klampanos, Iraklis A., Troumpoukis, Antonis, Jose, Joemon M.

Training Data Attribution for Image Generation using Ontology-Aligned Knowledge Graphs

arXiv.org Artificial IntelligenceDec-3-2025

As generative models become powerful, concerns around transparency, accountability, and copyright violations have intensified. Understanding how specific training data contributes to a model's output is critical. We introduce a framework for interpreting generative outputs through the automatic construction of ontologyaligned knowledge graphs (KGs). While automatic KG construction from natural text has advanced, extracting structured and ontology-consistent representations from visual content remains challenging -- due to the richness and multi-object nature of images. Leveraging multimodal large language models (LLMs), our method extracts structured triples from images, aligned with a domain-specific ontology. By comparing the KGs of generated and training images, we can trace potential influences, enabling copyright analysis, dataset transparency, and interpretable AI. We validate our method through experiments on locally trained models via unlearning, and on large-scale models through a style-specific experiment. Our framework supports the development of AI systems that foster human collaboration, creativity and stimulate curiosity.

large language model, machine learning, natural language, (21 more...)

2512.02713

Country: Europe (0.93)

Genre: Research Report (0.82)

Industry: Law > Intellectual Property & Technology Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-27-2025

SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Choi, Tae-Min, Jeong, Tae Kyeong, Kim, Garam, Lee, Jaemin, Koh, Yeongyoon, Choi, In Cheul, Chung, Jae-Ho, Park, Jong Woong, Park, Juyoun

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

artificial intelligence, large language model, natural language, (17 more...)

2511.21339

Genre: Research Report (0.40)

Industry: Health & Medicine > Surgery (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.92)

arXiv.org Artificial IntelligenceNov-26-2025

Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Guo, Muhao, Weng, Yang

Table I summarizes the datasets used for training and evaluation. Both baseline models and the PV AL framework were fine-tuned on 2,000 annotated tiles from Santa Ana, CA. The large-scale evaluation set includes about 100,000 tiles from Tempe and Santa Ana, while 480 tiles per region were used for cross-domain generalization tests across diverse climates and geographies. B. Multimodal LLM Configuration Configuring the PV AL system for solar panel detection involves a multi-faceted approach that integrates prompt engineering, output standardization, and supervised fine-tuning. This configuration is critical for steering the foundational GPT -4o model towards the specific, high-precision task of geospatial analysis. Prompt Task Decomposition Identify the presence of solar panels in images of residential rooftops, and determine their locations and quantity within the images. You will be provided with images that may contain residential rooftop solar systems. Analyze each image to detect solar panels. Steps: 1. ** Image Analysis **: Examine the entire image to identify any objects that appear to be solar panels.

large language model, machine learning, natural language, (18 more...)

2511.19537

Country:

Asia (0.96)
North America > United States > California > Orange County > Santa Ana (0.25)
North America > United States > Arizona > Maricopa County (0.14)

Genre: Research Report (0.64)

Industry: Energy > Renewable > Solar (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)

arXiv.org Artificial IntelligenceNov-17-2025

Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions

Hong, Mengze, Jiang, Di, Zhao, Weiwei, Li, Yawen, Wang, Yihang, Luo, Xinyuan, Sun, Yanjie, Zhang, Chen Jason

While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[\#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.

large language model, machine learning, natural language, (11 more...)

2511.10902

Country:

Asia > China (0.50)
Oceania > Australia (0.30)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)